{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Some informations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr />\n",
    "source: https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data\n",
    "\n",
    "**Data Description**\n",
    "\n",
    "In this competition, you will predict the probability that an auto insurance policy holder files a claim.\n",
    "\n",
    "In the train and test data, features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc). In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr />\n",
    "source: https://www.kaggle.com/c/porto-seguro-safe-driver-prediction#evaluation\n",
    "        \n",
    "Submissions are evaluated using the Normalized Gini Coefficient.\n",
    "\n",
    "During scoring, observations are sorted from the largest to the smallest predictions. **Predictions are only used for ordering observations; therefore, the relative magnitude of the predictions are not used during scoring.** The scoring algorithm then compares the cumulative proportion of positive class observations to a theoretical uniform proportion.\n",
    "\n",
    "The Gini Coefficient ranges from approximately 0 for random guessing, to approximately 0.5 for a perfect score. The theoretical maximum for the discrete calculation is (1 - frac_pos) / 2.\n",
    "\n",
    "The Normalized Gini Coefficient adjusts the score by the theoretical maximum so that the maximum score is 1."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "source: https://www.kaggle.com/c/ClaimPredictionChallenge/discussion/703\n",
    "     \n",
    "```Python\n",
    "def gini(actual, pred, cmpcol = 0, sortcol = 1):\n",
    "     assert( len(actual) == len(pred) )\n",
    "     all = np.asarray(np.c_[ actual, pred, np.arange(len(actual)) ], dtype=np.float)\n",
    "     all = all[ np.lexsort((all[:,2], -1*all[:,1])) ]\n",
    "     totalLosses = all[:,0].sum()\n",
    "     giniSum = all[:,0].cumsum().sum() / totalLosses\n",
    " \n",
    "     giniSum -= (len(actual) + 1) / 2.\n",
    "     return giniSum / len(actual)\n",
    " \n",
    " def gini_normalized(a, p):\n",
    "     return gini(a, p) / gini(a, a)\n",
    " \n",
    " def test_gini():\n",
    "     def fequ(a,b):\n",
    "         return abs( a -b) < 1e-6\n",
    "     def T(a, p, g, n):\n",
    "         assert( fequ(gini(a,p), g) )\n",
    "         assert( fequ(gini_normalized(a,p), n) )\n",
    "     T([1, 2, 3], [10, 20, 30], 0.111111, 1)\n",
    "     T([1, 2, 3], [30, 20, 10], -0.111111, -1)\n",
    "     T([1, 2, 3], [0, 0, 0], -0.111111, -1)\n",
    "     T([3, 2, 1], [0, 0, 0], 0.111111, 1)\n",
    "     T([1, 2, 4, 3], [0, 0, 0, 0], -0.1, -0.8)\n",
    "     T([2, 1, 4, 3], [0, 0, 2, 1], 0.125, 1)\n",
    "     T([0, 20, 40, 0, 10], [40, 40, 10, 5, 5], 0, 0)\n",
    "     T([40, 0, 20, 0, 10], [1000000, 40, 40, 5, 5], 0.171428,\n",
    "       0.6)\n",
    "     T([40, 20, 10, 0, 0], [40, 20, 10, 0, 0], 0.285714, 1)\n",
    "     T([1, 1, 0, 1], [0.86, 0.26, 0.52, 0.32], -0.041666,\n",
    "       -0.333333)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "source: https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/40222#225669\n",
    "\n",
    "Hi, Daniel!\n",
    "\n",
    "These names refer to the origin of variable, but there is no reason to concern about it. \n",
    "* **\"ind\"** is related to individual or driver, \n",
    "* **\"reg\"** is related to region, \n",
    "* **\"car\"** is related to car itself and \n",
    "* **\"calc\"** is an calculated feature.\n",
    "\n",
    "The features and rows are not time dependent, so \"ps_ind_01\" and \"ps_ind_02\" are just labels to hide real names.\n",
    "\n",
    "Unfortunately I can't share the real meanings of variables.\n",
    "\n",
    "Thanks for joining!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Baseline model - 0.252"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.model_selection import StratifiedShuffleSplit\n",
    "from sklearn.preprocessing import Imputer\n",
    "from sklearn.preprocessing import LabelBinarizer\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.metrics import make_scorer\n",
    "from sklearn_pandas import DataFrameMapper\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import regex as re\n",
    "\n",
    "# métrica utilizada na competição\n",
    "def gini(actual, pred, cmpcol = 0, sortcol = 1):\n",
    "     all = np.asarray(np.c_[ actual, pred, np.arange(len(actual)) ], dtype=np.float)\n",
    "     all = all[ np.lexsort((all[:,2], -1*all[:,1])) ]\n",
    "     totalLosses = all[:,0].sum()\n",
    "     giniSum = all[:,0].cumsum().sum() / totalLosses\n",
    " \n",
    "     giniSum -= (len(actual) + 1) / 2.\n",
    "     return giniSum / len(actual)\n",
    " \n",
    "def gini_normalized(a, p):\n",
    "    return gini(a, p) / gini(a, a)\n",
    "\n",
    "gini_norm_scorer = make_scorer(gini_normalized)\n",
    "\n",
    "# dataset de treino\n",
    "df = pd.read_csv(\"~/ps_kaggle_data/train.csv\", na_values='-1', index_col='id')\n",
    "\n",
    "# extrai metadados dos nomes das features\n",
    "col_pattern = re.compile(\"^ps_(?P<class>\\w+)_(?P<number>\\d+)(_(?P<type>\\w+))?$\")\n",
    "ci = pd.DataFrame({column: col_pattern.search(column).groupdict() for column in df.columns[1:]}).T\n",
    "ci.loc[ci.type.isnull(), 'type'] = 'num'\n",
    "\n",
    "# realiza preprocessamento sobre as features\n",
    "# features numéricas -> mean imputer\n",
    "# features binárias -> mode imputer\n",
    "# features categóricas -> mode imputer + label binarizer\n",
    "mapper = DataFrameMapper([([col], Imputer(strategy='mean')) for col in ci[ci.type == 'num'].index.tolist()] +\\\n",
    "                         [([col], Imputer(strategy='most_frequent')) for col in ci[ci.type == 'bin'].index.tolist()] +\\\n",
    "                         [([col], [Imputer(strategy='most_frequent'), LabelBinarizer()]) for col in ci[ci.type == 'cat'].index.tolist()])\n",
    "\n",
    "df_final = mapper.fit_transform(df.drop('target', axis=1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 10 folds for each of 1 candidates, totalling 10 fits\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  1.4min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=StratifiedShuffleSplit(n_splits=10, random_state=100, test_size=0.05,\n",
       "            train_size=None),\n",
       "       error_score='raise',\n",
       "       estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),\n",
       "       fit_params=None, iid=True, n_jobs=1, param_grid={},\n",
       "       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n",
       "       scoring=make_scorer(gini_normalized), verbose=1)"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# modelo regressão linear\n",
    "ln = LinearRegression()\n",
    "# particionador de dataset em 10 splits, mantendo a proporção das classes, com test_size de tamanho 10%\n",
    "sss = StratifiedShuffleSplit(n_splits=10, test_size=.1, random_state=100)\n",
    "param_grid = {}\n",
    "\n",
    "cv = GridSearchCV(ln, cv=sss, param_grid=param_grid, scoring=gini_norm_scorer, verbose=2)\n",
    "cv.fit(df_final, df.target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "# treina o melhor modelo\n",
    "model = cv.best_estimator_\n",
    "model.fit(df_final, df.target)\n",
    "\n",
    "# executa sobre os dados de treino\n",
    "test = pd.read_csv(\"~/ps_kaggle_data/test.csv\", na_values='-1', index_col='id')\n",
    "test_final = mapper.fit_transform(test)\n",
    "predictions = model.predict(test_final)\n",
    "\n",
    "# escala a predição pro range [0, 1]\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "\n",
    "mm = MinMaxScaler()\n",
    "predictions_0_1 = mm.fit_transform(predictions.reshape((predictions.shape[0], 1)))\n",
    "\n",
    "# cria arquivo para submissão\n",
    "pd.DataFrame(predictions_0_1, index=test.index, columns=['target']).to_csv(\"~/prediction.csv\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
